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r-H i.e. when features are correlated. First, we introduce a pooled centroids formu- 

^ ^ lation of the multi-class LDA predictor function, in which the relative weights of 

Mahalanobis-transformed predictors are given by correlation-adjusted f-scores (cat 
scores). Second, for feature selection we propose thresholding cat scores by control- 
ling false non-discovery rates (FNDR). Third, training of the classifier is based on 
I— I James-Stein shrinkage estimates of correlations and variances, where regularization 

Oh parameters are chosen analytically without resampling. Overall, this results in an 

effective and computationally inexpensive framework for high-dimensional predic- 
^ tion with nattiral feature selection. The proposed shrinkage discriminant procedvires 

^ are implemented in the R package "sda" available from the R repository CRAN. 
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1 Introduction 



Class prediction of biological samples based on their genetic or proteomic profile is now 
a routine task in genomic studies. Accordingly, many classification methods have been 
developed to address the specific statistical challenges presented by these data - see, e.g., 
Schwender et al. ( 2008| l and Slawski et al. ( 2008| l for recent reviews. In particular, the 



small sample size n renders difficult the training of the classifier, and the large number 
of variables p makes it hard to select suitable features for prediction. 

Perhaps surprisingly, despite the many recent innovations in the field of classification 
methodology, including the introduction of sophisticated algorithms for support vector 
machines and the proposal of ensemble methods such as random forests, the concep- 
tually simple approach of linear discriminant analysis (LDA) and its sibling, diagonal 
discriminant analysis (DDA), remain among the most effective procedures also in the 
domain of high-dimensional prediction (Efron 2008a Hand[ 2006 [Efron |1975) . 

In order to be applicable for high-dimensional analysis, it has been recognized 
early that regularization is essential (Friedman 1989). Specifically, when training the 
classifier, i.e. when estimating the parameters of the discriminant function from training 
data, particular care needs to be taken to accurately infer the (inverse) covariance 
matrix. A rather radical, yet highly effective way to regularize covariance estimation 
in high dimensions is to set all correlations equal to zero (Bickel and Levina 2004 1. 
Employing a diagonal covariance matrix reduces LDA to the special case of diagonal 
discriminant analysis (DDA), also known in the machine learning community as "naive 
Bayes" classification. 

In addition to facilitating high-dimensional estimation of the prediction function, 
DDA has one further key advantage: it is straightforward to conduct feature selection. 
In the DDA setting with two classes {K = 2) it can be shown that the optimal criterion for 
ordering features relevant for prediction are the f-scores between the two group means 
(e.g., [Fan and Fant[2008| , or in the multi-class setting, the t-scores between group means 
and the overall centroid. 

The nearest shrunken centroids (NSC) algorithm ( [Tibshirani et al.[ 2002 2003| , com- 
monly known by the name of "PAM" after its software implementation, is a regularized 
version of DDA with multi-class feature selection. The fact that PAM has established 
itself as one of the most popular methods for classification of gene expression data is am- 
ple proof that DDA-type procedures are indeed very effective for large-scale prediction 
problems - see also Bickel and Levina ( |2004 > and Efron ( 2008a| >. 

However, there are now many omics data sets where correlation among predictors 
is an essential feature of the data and hence cannot easily be ignored. For example, 
this includes proteomics, imaging, and metabolomics data where correlation among 
biomarkers is commonplace and induced by spatial dependencies and by chemical 
similarities, respectively. Furthermore, in many transcriptome measurements there 



are correlations among genes within a functional group or pathway ( Ackermann and 



Strimmer.2009) . 



Consequently, there have been several suggestions to generalize PAM to account for 
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correlation. This includes the SCRDA (Guo et al. 2007| , Clanc (Dabney and Storey 2007| 
and MLDA (Xu et al. 2009 1 approaches. All these methods are regularized versions of 
LDA, and hence offer automatic provisions for gene-wise correlations. However, in con- 
trast to PAM and DDA, they lack an efficient and elegant feature selection scheme, due 
to problems with multiple optima in the choice of regularization parameters (SCRDA) 
and the large search space for optimal feature subsets (Clanc). 

In this paper, we present a framework for efficient high-dimensional LDA analysis. 
This is based on three cornerstones. First, we employ James-Stein shrinkage rules for 
training the classifier. All regularization parameters are estimated from the data in an 
analytic fashion without resorting to computationally expensive resampling. Second, we 
use correlation-adjusted f-scores (cat scores) for feature selection. These scores emerge 
from a restructured version of the LDA equations and enable simple and effective 
ranking of genes even in the presence of correlation. Third, we employ false non- 
discovery rate thresholding for selecting features for inclusion in the prediction rule. As 
we will show below, this is a highly effective method with similar performance to the 
recently proposed "higher criticism" approach. 

The remainder of the paper is organized as follows. In Sections 2-5 we detail our 
framework for shrinkage discriminant analysis and variable selection. Subsequently, 
we demonstrate the effectiveness of our approach by application to a number of high- 
dimensional genomic data sets. We conclude with a discussion and comparison to 
closely related approaches. 



2 Linear discriminant analysis revisited 
2.1 Standard formulation 

LDA starts by assuming a mixture model for the p-dimensional data x 

fix) = f; TTjfixlj), 

where each of the K classes is represented by a multivariate normal density 

f{x\k) = (27r)-P/2|5;|-i/2 X 

exp{-2(^ - Fkf'^'\^ - H)} 

with group-specific centroids and a common covariance matrix E. The probability 
of group k given x is computed from the a priori mixing weights TTy by application of 
Bayes' theorem. 
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We define here the LDA discriminant score as the log posterior d]^{x) = log{Pr(A;|x)}, 
which after dropping terms constant across groups becomes 

= rilZ-'x - + log(7r,) . (1) 

Due to the common covariance, (i^^^(x) is linear in x. Prediction in LDA works by 
evaluating the discriminant function at the given test sample x for all possible k, choosing 
the class maximizing the posterior probability (and hence d^^^). 



2.2 Pooled centroid formulation 

We now rewrite the standard form of the LDA predictor function (Eq.|l| with the aim to 
elucidate the influence of each individual variable in prediction. Specifically, we simply 
add a class-independent constant to the discriminant function - note that this does not 
change in any way the prediction. We compute the pooled mean 

_ ^ 

representing the overall centroid (ny is the sample size in group j and n = Y^f^i the 
total number of observations) and the corresponding discriminant score 

"pool {■''■) — Hpool^ * 2 "pool "pool ■ 

The centered score 

can be interpreted as log posterior ratio and is in terms of prediction completely equiva- 
lent to the original d^^^(x). After some careful algebra it simplifies to 

Air°A(x) = a;[j,(x)+log(7r,) (2) 

with feature weight vector 

iVk = P-'''V-''HF,-Fpooi) (3) 
and Mahalanobis-transformed predictors 

S,{x) = p-i/2y-i/2(^ _ ^V^Vli )_ (4) 

Here, we have made use of the variance-correlation decomposition of the covariance 
matrix E = V^^^PV^^^, where V = diagjcr^, . . . , (7^} is a diagonal matrix containing the 
variances and P = (pij) is the correlation matrix. 

A remarkable property of the above restructuring (Eq. |2]-Eq. of the LDA discrimi- 
nant function (Eq. l|l is that both tOj^ and Si^{x) are vectors and not matrices. Furthermore, 
note that a;;^ is not a function of the test data x and that its components control how 
much each individual variable contributes to the score A^°^ of group k. 
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2.3 James-Stein shrinkage rules for learning the LDA predictor 

In order to train the LDA discriminant function (Eq. [l] or Eq. |2]) we estimate group 
centroids by their empirical means, and otherwise rely on three different James-Stein- 
type shrinkage rules. Specifically, we employ 



1. for the correlations P the ridge-type estimator from Schafer and Strimmer ( 2005[ l, 



2. for the variances V the shrinkage estimator from Opgen-Rhein and Strimmer 
( [2007| |, and 



3. for the proportions tt/^ the frequency estimator from|Hausser and Strimrrier (2009 1. 



All three James-Stein-type estimators are constructed by shrinking towards suitable 
targets and analytically minimizing the mean squared error. The precise formulas are 
given in Appendix A. For the statistical background we refer to the above mentioned 
references. 

We remark that the advantages of using James-Stein rules for data analysis have 
recently become (again) more appreciated in the literature, especially in the "small n, 
large p" setting, where James-Stein-type estimators are very efficient both in a statistical 
as well as in a computational sense. In training of the LDA predictor function by 
James-Stein shrinkage we follow Dabney and Storey ( 2007[ | and |Xu et al. ( 2009[ l, who 
give a comprehensive comparison with competing approaches such as support vector 
machines. [Slawski et al. ( |2008j l also implement a shrinkage version of LDA. 



3 Feature selection 



3.1 A natural variable selection score for LDA 



Zuber and Strimmer 



Following 

f-scores (cat scores) to be a scaled version of the feature weight vector cof. 



( 2009 1 we define the vector r'f^ of correlation-adjusted 



adj 



p-l/2 
p-1/2 



-1/2 



(5) 



nk 



The vector Xk contains the gene-wise gene-specific t-scores between the mean of group 

k and the pooled mean. Thus, the correlation-adjusted i-scores (rlf^) are decorrelated 

f -scores (rjt). If there is no correlation r'f' reduces to Xk- The facto r — \)^^^^ in Eq. 5 
standardizes the error of fi^. — and is the same as in PAM ( ^Tibshirani et al. 2003j r 

Note the minus sign, which is due to correlation s/rij^/n between p.^^ and /ipooj |^ 



^ The plus sign in the original PAM paper |Tibshirani et al. [2002 p. 6567) is a typographic error. 
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In DDA approaches, such as PAM, regularized estimates of the f-scores Tj^ are 



employed for feature selection. From Eqs. 



2-4 



it follows directly that the cat scores r'f^ 



provide the most natural generalization in the LDA setting (see also Remark A). 
As a summary score to measure the total impact of feature i G {l,...,p} we use 

s, = E(<o^ (6) 

i.e. the squared z-th component of the cat score vector r'f^ = {t'^I, ■ ■ ■ , T^p^lY summed 
across the K groups. For comparison, PAM uses the criterion 

max(|T,-,|). (7) 

7-i,...,K 

Using the squared sum of the group-specific cat scores in S, rather than taking the 
maximum over the absolute values as in Sj has two distinct advantages. First, the sample 
distribution of the estimated S, is more tractable, being approximately ](^. Second, if a 
feature is discriminative with regard to more than one group this additional information 
is not disregarded. 

3.2 Feature selection by controlling the false non-discovery rate 

When constructing an efficient classifier it is desirable to eliminate features that pro- 
vide no useful information for discriminating among classes. The conventional but 
computationally tedious approach is to choose the optimal threshold by estimating the 
prediction error via cross-validation along a grid of possible threshold values. Faster 
alternative thresholding procedures include "higher criticism" I Donoho and Jin 2008| l, 



FAIR" ( [Fan and FanH2008| l, and "Ebay" ( |Efron[|2008bl l. The latter two methods are 



primarily developed with the correlation-free setting and i-scores in mind (however, 
"Ebay" also offers correlation corrections for prediction errors). 

Here, we advocate using the false discovery rate (FDR) framework to select features 
for classification. We emphasize, however, that in the problem of constructing classifiers 
the FDR approach can not be applied in the same fashion as in differential expression. 
In the latter case, the aim is to compile a set of genes one has confidence in to be 
differentially expressed. This is controlled by the FDR criterion. In contrast, when 
furnishing classifiers one aims at identifying with confidence the set of null features that 
are not informative with regard to group separation, in order to eliminate them from the 
classifier. This is controlled by the false non-discovery rate, FNDR. For a discussion of 
the relation between FDR and FNDR see, e.g., Strimmer] ( 2008| . 



This subtle but important distinction is best illustrated by referring to Fig. [T| which 
plots the local FDR fdr(S/) = Prob("nuH"|S,) computed for (and from) the statistic S,- of 
feature /. In a list of differentially expressed genes we decide to include, say, genes i with 
fdr(S,) < 0.2. A similar constraint on the local false non-discovery rate, fndr(S,) < 0.2, 
gives a confidence set of the null genes. The local false discovery and local false non- 
discovery rates add up to one, fndr(S,) = 1 — fdr(S,). Hence, the set of features to be 



6 



1.0 
0.8 

local 

FDR 0.6 

0.4 
0.2 
0.0 









"null" features 




"non-null" features 


(FNDR control) 




(FDR control) 















variables included in the predictor 



Figure 1: Local false discovery rates as a function of the summary score S,. There are 
three distinct areas: an acceptance and a rejection zone, which are seperated by a "buffer 
zone" in the middle. Note that the features to be included in the classifier by FNDR 
control of the null genes form a superset of the differentially expressed genes controlled 
by FDR. 



retained in the classifier have local false discovery rates smaller than 0.8 - instead of 
0.2. Thus, the features included in the predictor form a superset of the differentially 
expressed variables. A similar argument applies when using distribution-based Fdr 
(ij-values) and Fndr values. 

In short, our proposal is to identify the null features by controlling (local) FNDR, and 
subsequently using all features except the identified null set in prediction. For estimating 
FDR quantities we use the semiparametric approach outlined in Strimmer ( 2008| . Note 
that this and other FDR procedures assume that there are enough null features so that 
the null model can be properly estimated ( [Efron 20041. 
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Table 1: The general feature selection score S, and special cases thereof. 



Si = 


K arbitrary 


K = 2 


correlation present: 


L;=l 






no correlation (P = I): 




'J 


2tI 



4 Special cases 

4.1 Two groups 

For K = 2 the cat score r'f^ between the group centroid and the pooled mean reduces to 
the cat score between the two means: 

^ 111 ni+n2 

= p-i/2x{(l + l)y}-V2( ). 
Ill n2 

Note that r'f^ = The feature selection score S, reduces to the squared cat score be- 

tween the two means, cf. Tab.jlj Likewise, for K = 2 the difference A^°^(x) — A2^^(x) re- 
duces to a;^<y(x) + log(^) with a; = p-^^^V-^/^{fi^-fi2) andS{x) = p-^/^y-i/i^^ _ 



For extensive discussion of the two group case, including comparison of gene rank- 
ings with many other test statistics, we refer to Zuber and Strimmer (2009||. 



4.2 Vanishing correlation 

In case of no correlation (P = I) the cat scores reduce (by construction) to standard 
t-scores between the two centroids of interest, either between the group and the pooled 
mean (general K) or between the two groups {K = 2). The gene summary S, reduces to 
the sum of the respective squared t-scores (Tab.[l|. The discriminant function reduces to 
the standard form of diagonal discriminant analysis. The pooled centroids formulation 
of LDA reduces to that of PAM (except for the shrinkage of the means present in PAM 
but not in our approach). 
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5 Remarks 



Remark A: Definition of feature weights 

The definition of feature weights according to Eq. |3] is most natural. Other ways of 
splitting up the product ii>lS]^{x) lead to various inconsistencies. For example, instead 
of using iO] ^ = Z^^^^{fij^ - ;ipooi) i t has been suggested to consider E"^ (;/^ - fip^Qi), for 
example in Witten and Tibshirani] ( |200 9), page 627. However, this choice implies that for 



the case of no correlation variable selection would be based on [ji^. — fip^gi) rather 
than on i-scores. 

Furthermore, dividing the inverse correlation equally between Eq.jsjand Eq.ji] 
greatly simplifies interpretation: feature selection takes place on the level of centered, 
scaled as well as decorrelated predictors Sk{x). Note that this interpretation is not 
hampered by the fact the decorrelation involves all features, because typically there is no 
substantial correlation between non-null and null features, so that the overall correlation 
matrix decomposes into correlation within null and within non-null variables. 

Remark B: Grouping of features 

Using cat scores for feature selection also greatly facilitates the grouping of features. 
Specifically, adding the squared cat scores of each feature contained in a gi ven set (e.g 



gene sets specified by biochemical pathways) leads to Hotelling's T , see Zuber and 
Strimmer ( 2009[ |. Note that if another decomposition than that of Eq. |3]and Eq. |4]was 



used the connection of cat scores with Hotelling's T would be lost 
Remark C: FDR methods for feature selection 

The usefulness of false discovery rates for feature selection in prediction is disputed. 



e.g., in Donoho and Jin (2008 1. What we show here is that the unfavorable performance 
is due to naive application of FDR, leading to the elimination of too many predictors. 
If instead FNDR is controlled to determine the null-features to be excluded from the 
discriminant function, then much more efficient prediction rules are obtained. 

Remark D: Fast computation of matrix square root 

The inverse square root of the correlation matrix, required in Eqs. [T|and[5| can be com 



puted very efficiently for the James-Stein shrinkage estimator, see [Zuber and Strimmer 
( |2009] ) for details. 

Remark E: Normalizing the null model 

Estimating false discovery rates using summary scores S, (Eq.|6]l assumes as null model 
a ;\;'^-distribution with unknown parameters. To employ standard FDR software we 
apply the cube-root transformation, which provides a normalizing transform for the 



-distribution (Wilson and Hilferty 1931 
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Figure 2: Violin plots of prediction error rates of various classification methods for the 
Singh et al. (2002) data. The violin plot is a generalization of the box plot, showing the 
median and upper and lower quartiles, as well as the density. Underlying each plot are 
200 estimates of prediction error computed from the 200 splits arising from balanced 
10-fold cross-validation with 20 repetitions. The number in round brackets indicates the 
number of selected features. See also Tab.lH 



6 Results 



We now illustrate our shrinkage DDA and LDA approaches with variable selection 
using cat scores and FNDR control by analyzing a number of reference data examples, 
and compare our results with that of competing approaches. We also investigate the 
performance of FNDR feature selection in comparison with that of "higher criticism" 
Ponoho ar\djinl[2008l l. 



6.1 Singh et al. (2002) gene expression data 



First, we investigated the prostate cancer data set of Singh et al. ( 2002| . This consists 
of gene expression measurements of p = 6033 genes for n = 102 patients, of which 52 
are cancer patients and 50 are healthy (thus K = 2). To facilitate cross-comparison we 
analyzed the data exactly in form as used in Efron (|2008a[|. Ou r results are summarized 
in Tab.|2j and corresponding violin plots ( Hintze and Nelsonj 1998| l are shown in Fig.|2] 
Initially, we assumed zero correlation and applied the shrinkage DDA method. By 
controlling the local FNDR to be smaller or equal than 0.2 we determined that 5867 
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Table 2: Prediction errors and number of selected features for Singh et al. (2000) gene 
expression data. The number in the round brackets is the estimated standard error. 

Method Prediction Error Features 



Ebay 


0.092 


51 


DDA-FDR 


0.1682 (0.0093) 


53 


LDA-FDR 


0.0989 (0.0056) 


62 


LDA-FNDR 


0.0550 (0.0048) 


131 


DDA-FNDR 


0.0640 (0.0049) 


166 


PAM 


0.0859 (0.0063) 


172-482 


DDA-ALL 


0.3327 (0.0099) 


6033 



The prediction error of Ebay is taken from Efron ( 2008a) |. 



genes were null genes, hence that 166 genes needed to be included in the prediction 
rule. For comparison, a local FDR cutoff on the same level yielded only 53 genes, lacking 
the 103 genes in the "buffer zone" between the two cutoffs (cf. Fig. Note that we 
recommend using the larger FNDR-based feature set, not just the 53 genes considered to 
be differentially expressed. 

We estimated the prediction error of the resulting classification rule using balanced 
10-fold cross-validation with 20 repetitions. For each of the in total 200 splits we trained 
a new prediction rule and estimated new feature rankings and FDR statistics, thereby 
including the selection process in the error estimate to avoid overoptimistic results 
( [Ambroise and ]V[cLachlan[|2002| ). The number of selected features shown in Tab.|2]is 
based on the complete data. However, for estimation of prediction error for each of the 
splits a new set of features was determined. 

For the FNDR-based cutoff with 166 included features we obtained an estimate of 
the prediction error of 0.0640 whereas for the naive FDR cutoff resulting in 53 predictors 
the error is much higher (0.1682). For comparison, we also computed the error using all 
6033 features, yielding a massive 0.3327. The PAM program selected between 172 and 
482 genes for inclusion in its predictor with error rate 0.0859 (note that the number of 
selected features by the PAM algorithm is highly variable and differs from run to run 
even for the same data set). According to |Efron (2008a > the Ebay approach used 51 genes 
for prediction with error rate 0.092. 

If correlation was taken into account, i.e. if the order of ranking was determined 
by cat rather than t-scores, interestingly both the number of differentially expressed 
genes and of the null genes increases, implying that the "buffer zone" shown in Fig. |l] 
becomes smaller. Thus, the LDA classifier with FNDR cutoff contained for this data 
fewer predictors (131) but at the same time nevertheless achieved the smallest overall 
prediction error (Fig.|2]). 
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Table 3: Estimated prediction errors for several multi-class reference data sets. 



Data 




Method 


Prediction Error 


Features 


DE 


Lymphoma 




DDA-FNDR 


0.0517 (0.0062) 


162 





(K = 3,n = 


62, 


LDA-FNDR 


0.0036 (0.0018) 


392 


55 


p = 4026) 




PAM 


0.0254 (0.0045) 


2796-3201 




SRBCT 




DDA-FNDR 


0.0007 (0.0007) 


90 


62 


(K = 4,n = 


63, 


LDA-FNDR 


0.0000 (0.0000) 


89 


76 


p = 2308) 




PAM 


0.0145 (0.0034) 


39-87 




Brain 




DDA-FNDR 


0.1892 (0.0146) 


33 


8 


(K = 5,n = 


42, 


LDA-FNDR 


0.1525 (0.0120) 


102 


23 


p = 5597) 




PAM 


0.1939 (0.0112) 


197-5597 





The last column (DE) shows the number of differentially expressed genes, which equals 
the number of significant features if FDR rather than FNDR is used as criterion. 



6.2 Performance for multi-class reference data sets 

For extended comparison we applied our approach to a number of further reference 



data sets. In particular we analyzed gene expression data for lymphoma ( Alizadeh 



et al. 2000| l, small round blue cell tumors (SRBCT) (B Chan et al.[|200l) and brain cancer 



(Pomeroy et al. 2002j|. The data sets have in common that all contain more than two 



classes, thus allowing to study the multi-class summary statistic (Eq.|6]l. A summary of 
the results obtained by shrinkage LDA/DDA and FNDR feature selection and by PAM 
is give n in Tab.[3| 



The Khan et al. ( |2001| data are very easy to classify. All methods performed equally 
well on this data, with no substantial difference between the LDA and DDA approaches. 

For the lymphoma data set the PAM approach failed to identify a compact set of 
predictive features. In contrast, the FNDR approach selects a comparatively small 
number of genes both in the LDA and DDA case. Intriguingly, for this data there were 
no differentially expressed genes, if correlation is ignored, yet the FNDR criterion yielded 
162 non-null features. 

The brain data set is the largest and most difficult data set. Again, the PAM approach 
failed to determine a stable set of features, whereas FNDR control yielded a compact set 
of informative predictors. Here, as well as for the lymphoma data, the LDA approach 
clearly outperformed the DDA approaches in terms of prediction error. 



6.3 Comparison with "higher criticism" feature selection 

Using the data examples above we demonstrated that feature selection based on simple 
FDR cutoffs is not sufficient for prediction. In particular, if features are weak and sparse 
it may easily happen that no predictor has sufficiently small false discovery rate to be 
called significant (cf. the lymphoma data). 



12 



Table 4: Estimated prediction errors employing higher criticism as feature selection 
criterion. 



Data 


Method 


Prediction Error 


Features 


local FDR 


Prostate 


DDA-HC 


0.0707 (0.0055) 


129 


0.69 




LDA-HC 


0.0497 (0.0045) 


116 


0.73 


Lymphoma 


DDA-HC 


0.0185 (0.0038) 


179 


1.00 




LDA-HC 


0.0000 (0.0000) 


345 


0.78 


SRBCT 


DDA-HC 


0.0035 (0.0016) 


138 


1.00 




LDA-HC 


0.0007 (0.0007) 


174 


1.00 


Brain 


DDA-HC 


0.1572 (0.0118) 


33 


0.77 




LDA-HC 


0.1417 (0.0108) 


131 


1.00 



The last column (local FDR) shows the local FDR of the least significant feature. 



In such a setting Donoho and Jin (2008 1 suggest as alternative to FDR-based thresh- 
olding the "higher criticism" (HC) approach. The HC criterion is based on p-values. For 
each feature, the p-value is centered and standardized using the estimated mean and 
variance of the corresponding order statistic. The optimal threshold is determined as 
the maximum of the absolute HC scores within a fraction (say 10%) of the top ranking 
features ( |Donoho and Jin} [2008) 1. 

Our feature selection approach based on FNDR control shares with HC that we 
aim to overcome the limitations resulting from naive application of FDR-based feature 
selection. For this reason, it is instructive to investigate our shrinkage prediction rule 
in combination with the HC thresholding procedure. The p-values underlying the HC 
objective function were obtained by fitting a two-component mixture model, so that the 
same empirical null model was used as in the FDR analysis. 

The results are given in Tab.]!] Again, in all cases the LDA approach using cat scores 
for feature selection leads to smaller prediction error than employing DDA and f-scores. 
Remarkably, the performance of FNDR and HC approach are on an equal level, implying 
that efficient feature selection is indeed possible within the FDR framework. The set of 
features selected by HC is on average a bit smaller than that chosen by FNDR, and larger 
than the FDR-based set, which indicates that the HC threshold is typically situated in 
the "buffer zone" of Fig.|l] 
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7 Discussion 



7.1 Shrinkage discriminant analysis and feature selection 

In this paper we have revisited high-dimensional shrinkage discriminant analysis and 
presented a very efficient procedure for prediction. Our approach contains three distinct 
elements: 

• the use of James-Stein shrinkage for training the predictor, 

• feature ranking based on cat scores, and 

• feature selection based on FNDR thresholding. 

Employing James-Stein shrinkage estimators is efficient both from a statistical as well as 
from a computational perspective. Note that shrinkage is used here only as a means to 
improve the estimated parameters, but not for model selection as in the approaches by 
Tibshirani et"ari ( |2002l l and |Guo et aT] ( |2007| . 



The correlation-adjusted t-score (cat score) emerges as a natural gene ranking crite- 
rion in the presence of correlation among predictors ( Zuber and Strimmer 2009| |. Here 



we have shown how to employ cat scores in the multi-class LDA setting and demon- 
strated on high-dimensional data that using cat scores rather than i-scores leads to 
more effective choice of predictors. We note that the order of ranking induced by the 
cat and i-scores, respectively, may differ substantially. Hence univariate thresholding 
procedures to select interesting features will differ, even if the testing procedures account 
for dependencies. 

Finally, we propose feature selection by controlling FNDR rather FDR and show that 
this is as efficient in terms of predictive accuracy as when "higher criticism" is employed. 
Moreover, we explain why variable selection based on FDR leads to inferior prediction 
rules. 



7.2 Recommendations 

For extremely high-dimensional data estimating correlation is very difficult, hence in 



this instance we recommend to conduct diagonal discriminant analysis (see also jBickel 



and Levina ( 2004[ |). From our analysis it is clear the shrinkage DDA as proposed here. 



combined with variable selection by control of FNDR or HC, is most effective. In 
contrast to the PAM approach no randomization procedures are involved and hence the 
prediction rule and the number of selected features is stable. 

In all other cases we recommend a full shrinkage LDA analysis, with feature selection 
based on cat scores. While this approach is computationally more expensive than the 
shrinkage DDA approach, it has a significant impact on predictive accuracy. Typically, 
in comparison with DDA taking account of correlation either leads to more compact 
feature sets or improved prediction error, or both. Furthermore, relative to competing 
full LDA approaches, such as Guo et al. ( 2007[ |, our procedure is computationally fast. 



due to the avoidance of computer-intensive procedures such as resampling. 
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Appendix A: James-Stein shrinkage estimators for training the 
LDA predictor 

For "small n, large p" inference of the LDA predictor function (Eqs.[l]and|2| and the cat 
score (Eq. |5]l we rely on three different James-Stein-type estimators. 

The correlation matrix is estimated by shrinking empirical correlations towards 
zero, 

with estimated intensity 

Ai = mm(l, — 2 — ) 

< |Schafer and Strimmer 2005] |. 



The variances are estimated by shrinking the empirical estimates vi towards their 
median. 



using 



A2 = mm(l, ) 

Zj(=i '-^i ^median^ 



( [Opgen-Rhein and Strimrrier 2007[ |. 



The class frequencies are estimated following Hausser and Strimmer ( |2009| l by 



i»shrink 



A3- + (l-A3)-f , 



using 



Appendix B: Relationship to other DDA and LDA approaches 

Our proposed shrinkage discriminant approach is closely linked to a number of recently 
proposed methods. 
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NSC 



The NSC / PAM classification rule was first presented in Tibshirani et al. ( 2002[ | and later 
discussed in more statistical detail in Tibshirani et al.| ( |2003| . PAM is a DDA approach, 
so no gene-wise correlations are taken into account. Genes are ranked according to 
Eq. [7| and feature selection is determined by soft-thresholding, using prediction error 
estimated by crossvalidation as optimality criterion. 



Ebay 

The "Ebay" approach of Efron 12008a') is also a DDA approach. Feature selection is based 
on an empirical Bayes model that links prediction error with false discovery rates. Thus, 
it is very similar to PAM but computationally and statistically more efficient. In addition, 
the "Ebay" algorithm provides correlation corrections of prediction errors, see Section 5 
in|Efrm|(^00^. 



Clanc and MLDA 

The "Clanc" algorithm is described in Dabney and Storey ( 2007') and the "modified 
LDA" (MLDA) in |Xu et al.] ( [2009] |. Both methods are based on the LDA framework, 
and both use James-Stein shrinkage to estimate the pooled covariance matrix. MLDA 
uses standard i-scores for feature selection, whereas Clanc employs a greedy algorithm 
search to find optimal subsets of features based on a multivariate criterion. 



SCRDA 



The "shrunken centroids regularized discriminant analysis" (SCRDA) procedure is 
described in Guo et al. ( 2007| l and uses a similar soft-thresholding procedure for variable 
selection as PAM. The covariance matrix is estimated by a ridge estimator. Regularization 
and feature selection parameters are simultaneously determined by cross-validation. 
The main issues with SCRDA are the computational expense and problems in finding 
unique parameters (Guo et al. 20071. 



Appendix C: Computer implementation 

We have implemented the proposed shrinkage discriminant procedures (both DDA and 
LDA) and the associated FNDR and higher criticism variable selection in the R package 
"sda", which is freely available under the terms of the GNU General Public License 
(version 3 or later) from CRAN ( http : //cran . r-proj ect . org/web/pa ckages /sdaT) . 
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